由于缺乏可用的数据集,模型和标准评估指标,因此以多模式数据为条件的现实,生动和类似人类的合成对话手势仍然是一个未解决的问题。为了解决这个问题,我们构建了人体表达式 - aauio-Text数据集,Beat,它具有76小时,高质量的,高质量的多模式数据,这些数据从30位扬声器中捕获了八种不同的情绪,用四种不同的语言,ii)32数以百万计的框架级别的情感和语义相关注释。我们对BEAT的统计分析表明,除了与音频,文本和说话者身份的已知相关性外,对话式手势与面部表情,情感和语义的相关性。基于此观察结果,我们提出了一个基线模型,即级联运动网络(CAMN),该模型由以上六种模式组成,该模式在级联的架构中建模以进行手势合成。为了评估语义相关性,我们引入了指标,语义相关性召回(SRGR)。定性和定量实验证明了指标的有效性,地面真相数据质量以及基线的最先进性能。据我们所知,BEAT是用于研究人类手势的最大运动捕获数据集,这可能有助于许多不同的研究领域,包括可控的手势合成,跨模式分析和情感手势识别。数据,代码和模型可在https://pantomatrix.github.io/beat/上获得。
translated by 谷歌翻译
Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
translated by 谷歌翻译
Automation of berthing maneuvers in shipping is a pressing issue as the berthing maneuver is one of the most stressful tasks seafarers undertake. Berthing control problems are often tackled via tracking a predefined trajectory or path. Maintaining a tracking error of zero under an uncertain environment is impossible; the tracking controller is nonetheless required to bring vessels close to desired berths. The tracking controller must prioritize the avoidance of tracking errors that may cause collisions with obstacles. This paper proposes a training method based on reinforcement learning for a trajectory tracking controller that reduces the probability of collisions with static obstacles. Via numerical simulations, we show that the proposed method reduces the probability of collisions during berthing maneuvers. Furthermore, this paper shows the tracking performance in a model experiment.
translated by 谷歌翻译
Our team, Hibikino-Musashi@Home (the shortened name is HMA), was founded in 2010. It is based in the Kitakyushu Science and Research Park, Japan. We have participated in the RoboCup@Home Japan open competition open platform league every year since 2010. Moreover, we participated in the RoboCup 2017 Nagoya as open platform league and domestic standard platform league teams. Currently, the Hibikino-Musashi@Home team has 20 members from seven different laboratories based in the Kyushu Institute of Technology. In this paper, we introduce the activities of our team and the technologies.
translated by 谷歌翻译
Wireless ad hoc federated learning (WAFL) is a fully decentralized collaborative machine learning framework organized by opportunistically encountered mobile nodes. Compared to conventional federated learning, WAFL performs model training by weakly synchronizing the model parameters with others, and this shows great resilience to a poisoned model injected by an attacker. In this paper, we provide our theoretical analysis of the WAFL's resilience against model poisoning attacks, by formulating the force balance between the poisoned model and the legitimate model. According to our experiments, we confirmed that the nodes directly encountered the attacker has been somehow compromised to the poisoned model but other nodes have shown great resilience. More importantly, after the attacker has left the network, all the nodes have finally found stronger model parameters combined with the poisoned model. Most of the attack-experienced cases achieved higher accuracy than the no-attack-experienced cases.
translated by 谷歌翻译
在本文中,我们提出了一个模型,以执行语音转换为歌声。与以前的基于信号处理的方法相反,基于信号处理的方法需要高质量的唱歌模板或音素同步,我们探索了一种数据驱动的方法,即将自然语音转换为唱歌声音的问题。我们开发了一种新型的神经网络体系结构,称为Symnet,该结构将输入语音与目标旋律的一致性建模,同时保留了说话者的身份和自然性。所提出的符号模型由三种类型层的对称堆栈组成:卷积,变压器和自发层。本文还探讨了新的数据增强和生成损耗退火方法,以促进模型培训。实验是在NUS和NHSS数据集上进行的,这些数据集由语音和唱歌语音的平行数据组成。在这些实验中,我们表明所提出的SYMNET模型在先前发表的方法和基线体系结构上显着提高了客观重建质量。此外,主观听力测试证实了使用拟议方法获得的音频质量的提高(绝对提高了0.37的平均意见分数测度量度比基线系统)。
translated by 谷歌翻译
最近,已经开发了各种视觉变压器作为对远程依赖性建模的能力。在当前的基于变压器的主骨用于医疗图像分割的骨架中,卷积层被纯变压器替换,或者将变压器添加到最深的编码器中以学习全球环境。但是,从规模的角度来看,主要有两个挑战:(1)尺度内问题:在每个尺度中提取局部全球线索所缺乏的现有方法,这可能会影响小物体的信号传播; (2)尺度间问题:现有方法未能从多个量表中探索独特的信息,这可能会阻碍表示尺寸,形状和位置广泛的对象的表示形式学习。为了解决这些局限性,我们提出了一个新颖的骨干,即比例尺形式,具有两个吸引人的设计:(1)尺度上的尺度内变压器旨在将基于CNN的本地功能与每个尺度中的基于变压器的全球线索相结合,在行和列的全局依赖项上可以通过轻巧的双轴MSA提取。 (2)一种简单有效的空间感知尺度变压器旨在以多个尺度之间的共识区域相互作用,该区域可以突出跨尺度依赖性并解决复杂量表的变化。对不同基准测试的实验结果表明,我们的尺度形式的表现优于当前最新方法。该代码可公开可用:https://github.com/zjugivelab/scaleformer。
translated by 谷歌翻译
本文提出了一种通过视觉解释3D卷积神经网络(CNN)的决策过程的方法,并具有闭塞灵敏度分析的时间扩展。这里的关键思想是在输入3D时间空间数据空间中通过3D掩码遮住特定的数据,然后测量输出评分中的变更程度。产生较大变化程度的遮挡体积数据被认为是分类的更关键元素。但是,虽然通常使用遮挡敏感性分析来分析单个图像分类,但将此想法应用于视频分类并不是那么简单,因为简单的固定核心无法处理动作。为此,我们将3D遮挡掩模的形状调整为目标对象的复杂运动。通过考虑从输入视频数据中提取的光流的时间连续性和空间共存在,我们的灵活面膜适应性进行了。我们进一步建议通过使用分数的一阶部分导数相对于输入图像来降低其计算成本,以近似我们的方法。我们通过与删除/插入度量的常规方法和UCF-101上的指向度量来证明我们方法的有效性。该代码可在以下网址获得:https://github.com/uchiyama33/aosa。
translated by 谷歌翻译
我们通过单数值分解(SVD)近似游戏TIC-TAC-TAC的评估函数,并研究了近似准确性对获胜率的影响。我们首先准备了TIC-TAC-TOE的完美评估函数,并通过将评估函数视为第九阶张量来进行低级近似。我们发现,我们可以将评估功能的信息量减少70%,而不会显着降低性能。近似准确性和获胜率密切相关,但不完全成比例。我们还研究了评估函数的分解方法如何影响性能。我们考虑了两种分解方法:关于评估函数的简单SVD作为矩阵和高阶SVD(HOSVD)的Tucker分解。在相同的压缩比下,通过HOSVD获得的近似评估函数的策略表现出明显高于SVD获得的策略。这些结果表明,SVD可以有效地压缩棋盘游戏策略,并有一种取决于游戏的最佳压缩方法。
translated by 谷歌翻译
本文介绍了Scalucs,这是一种四足动物,该机器人在地上,悬垂和天花板上爬上攀爬,并在地面上爬行。 Scaleer是最早的自由度四束机器人之一,可以在地球的重力下自由攀爬,也是地面上最有效的四足动物之一。在其他最先进的登山者专门攀登自己的地方,Scaleer承诺使用有效载荷\ Textit {和}地面运动实践自由攀爬,这实现了真正的多功能移动性。新的攀登步态滑冰步态通过利用缩放器的身体连锁机制来增加有效载荷。 Scaleer在地面上达到了最大归一化的运动速度,即$ 1.87 $ /s,$ 0.56 $ m /s,$ 1.2 $ /min,或$ 0.42 $ m /min /min的岩石墙攀爬。有效载荷能力达到地面上缩放器重量的233美元,垂直墙上的$ 35 $%。我们的山羊抓手是一种机械适应的两指抓手,成功地抓住了凸凸和非凸的对象,并支持缩放器。
translated by 谷歌翻译